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Abstract 

Let  the  m x n linear  model 

y — Kx  + e 

be  obtained  by  discretizing  a system  of  first-kind  integral  equations 

Vi  = f Ki(£)x(()  d£  + ii  , i — 1, . . . , m , 

J a 

with  known  functions  Ki(£),  unknown  function  z(£),  and  an  m-vector  y of  measurements  corrupted  by 
random  errors  e drawn  from  a distribution  with  zero  mean  vector  and  known  variance  matrix.  Consider  the 
problem  of  estimating  linear  functions  of  the  form 


4>  = wT  x 


[ H0x(t)d£  . 

J a 


where  u;(£)  is  an  averaging  function  designed  to  elicit  some  desired  information  about  z(£).  For  such  prob- 
lems, the  least  squares  solution  is  a highly  unstable  function  of  the  measurements,  and  the  classical  confidence 
intervals  are  too  wide  to  be  useful.  The  solution  can  often  be  stabilized  by  imposing  physically  motivated,  a 
priori  nonnegativity  constraints  on  x.  This  paper  will  show  how  to  extend  the  classical  confidence  interval 
estimation  procedure  to  accommodate  these  nonnegativity  constraints  in  order  to  obtain  improved  confidence 
intervals.  The  technique  defines  valid  confidence  intervals  even  for  problems  with  m < n. 
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1.  Introduction 

Consider  the  linear  model 

y ~ Kx  + e , (1.1) 

where  y is  a measured  m-vector,  K is  a known  m x n matrix,  x is  an  unknown  n-vector,  and  e is  a random 
77>vector  with  zero  mean  and  known  positive  definite  variance  matrix  S2 , i.e., 

5(e)  = o,5  (eeT)  = S2  . (1.2) 


We  are  particularly  interested  in  the  case  when  this  model  arises  in  discretizing  a system  of  first  kind  integral 
equations 


Vi  - J Ki(()x(()d^  + ii  , * = 1,  — , 


m 


(1.3) 


We  assume  that  n is  chosen  large  enough  to  assure  that  the  discretization  errors  are  negligible  in  comparison 
with  the  Ci. 

Without  severe  a priori  restrictions  on  x(£),  the  m discrete  relationships  (1.3)  cannot  determine  the  value 
of  z(£)  at  every  point  on  the  interval  [a,  b].  Therefore  we  seek  instead  to  estimate  a finite  set  of  average 
values  of  i(£),  averaged  on  various  subintervals  of  [a,  6].  More  precisely,  we  choose  p representative  points 
£i  < £2  < • • • < £p  in  [a,  b ] and  for  each  (k  choose  a corresponding  window  function  wk(s)  designed  so  that 


(1.4) 


is  an  average  value  of  the  unknown  function  in  a subinterval  around  £*.  Discretizing  these  averaging  integrals 
by  the  same  method  used  to  discretize  (1.3)  gives  a set  of  p linear  functions 


<t>k  - wTkx  , k = l,2,...,p  , 


(1.5) 


which  approximate  the  desired  averages.  We  are  again  assuming  that  the  discretization  errors  are  negligible 
in  comparison  to  the  Note  that  the  individual  components  of  x can  be  specified  by  choosing  p — n and 


wk  - ek  , k = 1,2, ...  ,n  , 
where  ek  is  the  fcth  column  of  the  identity  matrix. 


(1.6) 
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Having  accomplished  the  discretization,  the  task  becomes  one  of  estimating  each  of  the  p linear  functions 
(1.5)  subject  to  the  statistical  constraints  (1.1)  and  (1.2).  If  rank(.K')  = n,  then  these  are  classical  linear 
regression  problems  with  the  well  known  best  linear  unbiased  (least  squares)  estimates 

h = wTkx  , x = (KTS-’K)~'KTS-3y  . (1.7) 

When  each  of  these  point  estimates  is  associated  with  the  corresponding  subinterval  about  £*. , the  resulting 
set  of  pairs  comprises  a discrete  approximation  to  the  unknown  function  x(£).  The  reliability  of  such  an 
approximation  can  be  assessed  by  computing  confidence  interval  estimates  \<f>lk  , for  each  of  the  (pk.  There 
are  two  kinds  of  confidence  intervals  that  can  be  considered  [8,  Chapt.  6]: 

one-at-a-time  confidence  intervals  : Each  <f>k  is  treated  individually,  and  for  a pre-specified  probability 
a (0  < a < 1),  the  100a%  confidence  intervals  are  determined  separately.  Each  individual  interval 
contains  the  corresponding  true  value  with  probability  a,  i.e. , 

Pr  {<t>[°  < < <t>ukv  } - a , k — 1,2  ,...,p  . (1.8) 

Repeating  the  measurements  a large  number  of  times  M,  and  calculating  the  full  set  of  p intervals 
each  time,  would  give  roughly  aMp  intervals  containing  the  corresponding  true  values  and  roughly 
(1  — a)Mp  failing  to  contain  the  true  value. 

simultaneous  confidence  intervals  : All  of  the  <f>k  are  treated  together,  and  the  intervals  are  constructed 
so  that,  with  probability  a,  they  all  contain  the  corresponding  true  values,  i.e., 

Pr  { <t>[°  < 4>k  < 4>7  , A = l,2,...,p}>a  . (1.9) 

Repeating  the  measurements  M times,  and  calculating  a full  set  of  p confidence  intervals  each  time, 
would  give  roughly  aM  sets  in  which  all  of  the  intervals  contain  the  corresponding  true  values,  and 
roughly  (1  — a)M  sets  containing  at  least  one  interval  which  fails  to  contain  the  corresponding  true 
value.  It  is  clear  that  this  is  a more  exacting  requirement  than  that  for  one-at-a-time  intervals,  so  it 
is  not  surprising  that,  for  a given  a,  each  of  the  simultaneous  intervals  is  wider  than  its  one-at-a-time 
counterpart. 

The  inversion  of  first  kind  integral  equations  is  an  ill-posed  problem,  so  the  matrix  K is  usually  ill- 
conditioned  or  even  rank  deficient.  As  a result  the  confidence  intervals  are  typically  extremely  wide  or  even 
unbounded.  To  get  physically  plausible  solutions,  it  is  necessary  to  impose  a priori  constraints  on  the  solution. 
In  many  real-world  problems  the  solution  is  known  to  be  nonnegative,  and  good  algorithms  for  computing 
nonnegatively  constrained  point  estimates  have  been  widely  available  for  some  time  [10] . In  previous  work 
[12]  we  considered  the  problem  of  computing  nonnegatively  constrained,  simultaneous  confidence  intervals 
when  e is  normally  distributed.  We  showed  that  the  confidence  intervals  can  be  computed  by  solving  certain 
constrained  optimization  problems  and  presented  an  algorithm  for  solving  them.  In  this  paper  we  extend 
those  results  to  include  nonnegatively  constrained  one-at-a-time  intervals. 

In  §2  we  present  two  example  problems  used  throughout  the  paper  to  illustrate  our  results.  In  §3 
we  briefly  review  classical  point  estimation  theory  with  an  emphasis  on  the  effects  of  ill-conditioning  and 
rank-deficiency.  In  §4  we  briefly  review  classical,  one-at-a-time  confidence  intervals,  and  in  §5  we  consider 
the  geometrical  basis  for  interval  estimation  in  order  to  characterize  the  bounds  as  solutions  to  certain 
constrained  optimization  problems  which  will  later  be  augmented  to  include  the  nonnegativity  constraints. 
In  §6  we  briefly  review  the  estimation  of  classical,  simultaneous  confidence  intervals,  whose  geometrical  basis 
is  identical  to  that  for  the  one-at-a-time  intervals.  The  purpose  of  the  discussion  in  these  initial  sections 
is  to  note  the  deficiencies  of  the  classical  bounds  and  point  estimates  when  used  for  discretizations  of  ill- 
posed  problems.  In  §7  we  formally  append  the  constraints  x > o to  the  classical  linear  regression  model 
and  demonstrate  their  effectiveness  in  stabilizing  the  point  estimates.  We  also  show  that  these  constraints 
almost  always  produce  bounded  confidence  intervals,  even  for  underdetermined  problems,  and  show  how  to 
extend  the  theory  for  simultaneous  intervals  to  accommodate  them.  The  extension  of  the  classical  theory 
for  one-at-a-time  intervals  is  considerably  more  difficult,  but  in  §8  we  prove  a conjecture,  first  made  some  20 
years  ago  by  Walter  R.  Burrus  [17],  which  allows  the  calculation  of  such  intervals  using  the  same  algorithm 
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we  originally  developed  for  simultaneous  intervals.  Finally,  a real-world  example  involving  the  measurement 
of  a gamma-ray  spectrum  is  presented  in  §9.  Since  we  assume  that  the  only  significant  errors  are  the  random 
perturbations  e in  the  measured  data,  we  implicitly  assume  that  the  model  is  correct.  In  particular,  this 
implies  that  errors  due  to  discretization  of  the  integrals  are  negligible.  Thus  we  produce  confidence  intervals 
corresponding  to  the  class  of  functions  for  which  the  quadrature  is  sufficiently  accurate  (or  the  instrument 
resolution  sufficiently  fine). 


2.  Test  Problems 

To  generate  problems  with  known  solutions,  we  use  the  integral  equation  studied  by  Phillips  [14] 

y(t)  = j , -6  < t < 6 


where 


and 


K(t,  0 = 


1 + cos 


3 


0 


, | £ -t  |<  3 , 1 1 |<  6 , 

, otherwise  , 


y(0 


_ | (6-  I * I)  [l  + 3 cos  (f )]  + £ sin  , |t|<6, 

0 , otherwise  . 


(2.1) 


(2.2) 


(2.3) 


The  kernel  K(t,£)  is  nonnegative,  a property  common  in  many  real-world  applications.  The  function  y(t) 
is  also  nonnegative,  symmetric  about  t = 0,  and  bell-shaped,  with  maximum  value  y(0)  = 9 and  minimum 
value  y(—6)  — 0 = y(+6).  The  exact  solution  is 


*(0 


1 + cos^^j  , | £ |<  3 , 

, otherwise  . 


0 


(2.4) 


This  nonnegative  function  also  defines  a symmetric  bell-shaped  curve  with  a maximum  z(0)  = 2 and  minima 
z(-3)  = 0 = x(+3). 

To  get  a system  of  integral  equations,  we  chose  m mesh  points 


- 6 < tx  < <2  < t3  < • • • < im_i  < tm  < +6 
and  define  Kt( £)  = K(ti,£).  This  gave 


Vi  = y(U)  = J Ki(t)x(()d£  , i — 1,2,..., 


m 


We  used  the  trapezoidal  rule  on  the  mesh 

£i  = — 3 , £2  = —2.95  , £3  — —2.90  , • • • , ^6i  = 0 , • • • , £12i  - 3 
to  reduce  this  system  of  integral  equations  to 

y — Kx  , 


(2.5) 


(2.6) 


(2.7) 


(2.8) 


where  the  are  calculated  from  (2.3)  and  the  Xj  = x(£j)  are  calculated  from  (2.4).  In  order  to  assure  that 
every  row  of  the  matrix  K subtends  at  least  one  quadrature  panel,  we  chose  the  t-mesh  to  be  equally  spaced 
on  the  interval  [—5.925,  5.925]: 


11.85 


U = —5.925  -\ (i  — 1)  , z = 1,  2, . . .,m  . 

m—1 


(2.9) 


In  order  to  obtain  test  problems  in  which  the  quadrature  errors  were  completely  negligible  relative  to  the 
statistical  errors,  we  did  not  use  the  y- vector  computed  from  (2.3),  but  rather  used  y defined  by 


y = Kx 


(2.10) 
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The  calculation  of  y was  double  precision,  with  results  rounded  to  single  precision  on  a machine  with 
fmoch  — 7 x 10-15.  The  vector  x can  be  regarded  as  the  exact  quantity  to  be  estimated,  and  the  vector  y 
can  be  regarded  as  the  measurements  obtained  without  error. 

To  get  the  random  “measuring”  errors  £{,  we  let  Sj  = (I0“6)i/i,  i — 1,2, . . .,  m,  and  picked  sample  vectors 
e from  a multivariate  normal  distribution  with  independently  distributed  elements,  i.e., 

e = (ti,e2,...,£m)T  ~ N(o  , S2  ) , (2.11) 

where  S2  = diag  ( s2  , s2  , • • • , ).  Adding  the  random  errors  chosen  in  this  manner  to  the  system  (2.10) 

gave 

y = Kx  + e , e ~ N(  o , S2  ) (2.12) 

where  y is  the  vector  of  “measured”  values.  Since  the  calculation  of  the  yx  assures  that  (2.10)  is  satisfied  to  14 
digits,  the  above  standard  deviations  are  all  from  6 to  8 orders  of  magnitude  greater  than  the  corresponding 
truncation  errors  that  arose  in  forming  y.  Thus  the  random  errors  were  essentially  the  only  errors  in  the 
problem. 

2.1.  Problem  I — An  Overdetermined  Problem 

For  the  first  test  problem  we  chose  m — 150  and  n = 121,  producing  a matrix  K of  rank  n.  For  statistical 
studies,  we  generated  several  such  problems  by  randomly  choosing  different  error  vectors  e,  but  picked  a 
typical  one  for  use  in  this  paper.  The  least  squares  solution  for  this  one  is  shown  in  the  top  of  Figure  1, 
together  with  the  exact  solution  x(s).  The  maximum  and  minimum  singular  values  of  the  scaled  matrix 
S~1K  are  cti  = 3.3950  x 109  and  a 121  = 1.1610  so  the  condition  number  is  cond(S-1.RT)  = 2.924  x 109. 
Since  emac/i  — 7 x 10 — 15 , we  could  reasonably  expect  to  compute  x accurate  to  6 digits.  This  means  that 
all  of  the  variation  of  the  x(£j)  about  the  true  solution  x(£)  can  be  attributed  to  the  measurement  errors 
£i . This  variation  is  surprisingly  small  in  view  of  the  large  condition  number  of  the  matrix,  but  is  still 
distressingly  large  when  viewed  as  relative  error  in  the  solution.  Many  of  the  ii  are  negative  even  though 
the  exact  solution  x(£)  is  everywhere  nonnegative. 

2.2.  Problem  II  - An  Underdetermined  Problem 

Although  the  matrix  in  Problem  I has  a large  condition  number,  the  least  squares  problem  is  better  than  most 
that  arise  from  first  kind  integral  equations.  Many  problems  have  rank  deficient  or  even  underdetermined 
matrices.  If  the  kernel  is  accessible,  then  it  is  always  possible  to  choose  n < m,  but  this  may  produce 
discretization  errors  larger  than  the  measuring  errors.  Even  if  discretization  error  is  acceptably  small, 
choosing  n < m may  still  generate  a rank-deficient  problem.  Many  real-world  problems  are  quite  naturally 
underdetermined,  with  m and  n being  fixed  by  hardware  considerations  or  other  physical  constraints.  Often 
the  analyst  is  given  not  the  kernel  functions  but  rather  the  matrix  K with  m < n.  To  simulate  such 

problems  we  discretized  the  Phillips  equation  with  m = 108  and  n — 121.  The  matrix  K has  rank  108,  and 
the  maximum  and  minimum  non-zero  singular  values  of  S~lK  are  <j\  — 3.3728  x 109  and  Cios  — 0.14635. 
The  bottom  frame  of  Figure  1 gives  a plot  of  the  generalized  inverse  (minimal  length  least  squares)  solution 

*=  {S-'K)'  S-'y  , (2.13) 

which  oscillates  about  the  true  solution  with  wider  variations  than  those  obtained  for  Problem  I.  This  higher 
noise  level  is  due  mostly  to  the  loss  of  information  in  using  fewer  observations.  Of  course  the  estimate  is  not 
unbiased,  but  the  oscillations  tend  to  be  centered  on  the  true  curve,  so  the  bias  is  evidently  small  relative 
to  the  random  scatter. 

2.3.  Window  Functions  and  Window  Vectors 

The  solutions  presented  in  the  preceding  section  were  highly  oscillatory  because  of  the  ill-conditioning  of  the 
regression  model.  These  oscillations  can  be  damped  to  some  extent  by  seeking  estimates  for  a set  of  average 
values  of  the  function  z(£)  as  in  (1.4).  The  simplest  set  of  window  functions  is 

M0  = 6(Z-h)  . * = l,2,...,n  , (2.14) 
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150  X 121  Phillips  Problem 
Classical  Least  Squares  Solution 


I 


108  X 121  Phillips  Problem 
Generalized  Inverse  Solution 


* 


Figure  1.  Classical  solutions.  Dashed  lines  are  exact  solutions  z(£)  and  solid  lines  are  estimates  ijt  = 
Top:  Problem  I,  least  squares  estimate.  Bottom:  Problem  II,  generalized  inverse  estimate. 


where  6{(,  — (k)  is  the  Dirac  delta  function  centered  at  the  point  In  this  case  the  corresponding  window 
responses  are 


<t>k 


= / = . 
J a 


(2.15) 


and  the  corresponding  window  vectors  (1.6)  are  just  the  columns  of  the  n-th  order  identity  matrix.  No  actual 
averaging  is  involved  and  the  estimates  k)  correspond  to  those  computed  in  the  previous  section. 

The  simplest  set  of  real  averages  is  defined  by 

k = ~ [ih+l  x{()dt  , k = 1,2, 1 , (2.16) 

where  = £k+i  — Each  of  these  non-overlapping  averages  is  combined  with  its  corresponding  ^-interval 
to  give 

V2  = { ([&,fc+i],&)  = k = l,2,...,n-l  } , (2.17) 

a histogram  approximation  to  z(£).  Using  the  trapezoidal  rule  to  discretize  the  integrals  (2.16)  gives  the 
window  vectors 


Wi 

= ( 0.5 

0.5 

0 

0 

0 

•••  0) 

w2 

= (° 

0.5 

0.5 

0 

0 

•••  0) 

W3 

- (0 

0 

0.5 

0.5 

0 

...  0) 

Wn-1 

= (0 

0 

0 

0 

0.5  0.5  ) 

(2.18) 


so  we  call  this  process  2-point  averaging. 

Averaging  can  be  extended  to  include  more  than  one  quadrature  panel  in  each  average,  ending  ultimately 
with  n-point  averaging  which  would  seek  to  estimate  only  the  average  value  of  a:(£)  on  the  whole  interval 
[o,  6].  In  general,  v-point  averaging  uses  window  functions 

•»*({)  = { • * = 1'2 p.  (2-i9> 

where  p is  the  largest  integer  less  than  or  equal  to  (n  - l)/(t/  — 1).  The  typical  window  vector  has  the  form 


Wife 


= ( 


°>  7^1-  ITT! I •••>  7=T>  2f7^TJ’  °>  •••>  °)  ’ (2-20) 

with  exactly  u non-zero  elements  beginning  with  element  number  ( k — l)(i/  — 1)  + 1.  The  corresponding 
histogram  approximation, 

'P v — { ([^(it-i)(i'-i)+i!^ifc(i/-i)+i])(^ife)  : k — 1)2, . . . , p | , (2.21) 

will  cover  the  whole  interval  [a,  6]  only  if  (n  — 1)  is  an  exact  multiple  of  ( i>  — 1).  Otherwise,  there  will  be 
one  or  more  (but  fewer  than  v)  quadrature  panels  that  will  not  be  included  in  the  set  of  averages. 


3.  Classical  Point  Estimation 

Consider  the  problem  of  estimating  the  linear  function 


<p  — wTx 

(3.1) 

subject  to  the  statistical  constraints 

y = Kx  + ~e  , £(e)  = o , £ (eeT)  = S 2 . 

(3.2) 

A linear  estimator 

<f>  = uTy  , 

(3.3) 

with  mean  and  variance 

£(<p)  — uT Kx  , Var(0)  — uT S3u  . 

(3.4) 

is  unbiased  if  £(<f>)  — (j)  = 

wT x identically  in  * [18,  Chapt.  1].  Thus  4>  is  unbiased  if  and  only  if 

utK  = wt  , 

(3.5) 

which  means  the  vector  w is  a linear  combination  of  the  rows  of  K. 
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3.1.  Full  Rank  Problems  - The  Best  Linear  Unbiased  Estimator 


If  rank(itT)  — n < m,  an  unbiased  estimator  exists  for  any  linear  function  wT  x.  In  fact,  there  will  be  many 
unbiased  estimators 

(3.6) 


T - 

u y 


wTK'  + zT  ( Im-KK t 


y 


where  K * is  the  generalized  inverse  of  K and  z is  any  nwector.  The  best  linear  unbiased  estimator  is  the 
one  that  minimizes  the  variance,  i.e.,  the  one  corresponding  to  the  vector  u that  solves  the  constrained 
optimization  problem 

Var(uTy)  = min  { uT  S3u  | uT K — wT  } . 


It  is  easy  to  see  that  the  solution  vector  is 

u=  S~3K(KT  S~3K)~'w  , 
so  the  best  linear  unbiased  estimator  is 

= uTy  = wT (KT  S~3 K)~1  Kt  S~3y 

This  estimator  can  also  be  written 


(/)  - wT  x , 


where 


x = {KT  S~3K)~'Kt S~3y 

is  the  solution  vector  for  the  weighted  least  squares  problem 

r2  =■  min  {(y  — Kx)TS~2(y  — Kx)}  . 

•E 

It  follows  from  (3.4)  and  (3.8)  that  the  minimum  variance  is 

Var(<^)  = uTS2u  = wT (KT  S~2 K)~lw 


(3.7) 

(3.8) 

(3.9) 

(3.10) 

(3.11) 

(3.12) 

(3.13) 


As  an  example,  consider  Problem  I.  If  the  set  of  window  vectors  is  taken  to  be  the  columns  of  In,  then 
the  <j) k are  just  the  elements  of  the  least  squares  solution  x as  in  the  top  frame  of  Figure  1.  The  estimates 
corresponding  to  the  window  vectors  (2.20)  for  2-point  averaging  are  shown  in  the  top  frame  of  Figure  2, 
and  the  estimates  for  3-point  averaging  are  shown  in  the  bottom  frame.  The  noise  level  has  been  reduced 
by  averaging,  but  this  improvement  is  achieved  only  at  the  expense  of  a reduction  in  resolution  for  the 
independent  variable  (.  Note  also  that  the  averaging  does  not  prevent  the  occurrence  of  some  negative 
estimates. 


3.2.  Underdetermined  Problems 


If  K has  less  than  full  column  rank,  then  the  set  of  all  least  squares  solutions  can  be  written 

*(*)  = (s-1*:)1  s~ly  + \Jn  - {s-'Ky  (S-1^) 


(3.14) 


where  z is  an  arbitrary  ra-vector.  An  unbiased  estimator  for  wT x exists  only  if  wT  is  an  exact  linear 
combination  of  the  rows  of  K,  in  which  case  the  best  linear  unbiased  estimate  is  given  by  the  u-vector 


u = S-1  [(S-1#)1 

T 

w , 

(3.15) 

so 

= uT y = wT  (S  1 K)  S 
where  x is  the  minimal  length  solution  (2.13).  The  variance 

1 - t - 

y = w x , 

of  this  estimator  is 

(3.16) 

Var(<£)  = uT S2u  = wT  (ktS  2 k\  w . 

(3.17) 
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Classical  Least  Squares,  2-Point  Averaging 


150  X 121  Phillips  Problem 


Classical  Least  Squares,  3 -Point  Averaging 


* 


Figure  2:  Best  linear  unbiased  estimates  for  Problem  I.  Dashed  lines  are  exact  solutions  z(£)  and  solid  lines 
are  histogram  estimates  <j>k. 

Top:  2-point  averages  over  intervals  [£i,£jfe+i].  Bottom  : 3-point  averages  over  intervals  + 
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Being  limited  to  window  vectors  expressible  as  linear  combinations  of  the  rows  of  K places  severe  restrictions 
on  the  information  that  can  be  elicited  about  the  unknown  function  x(£).  This  restriction  is  imposed  in 
the  method  of  Backus  and  Gilbert  [19,  Chapt.  7]  that  constructs  just  such  a set  of  window  vectors,  called 
averaging  kernels.  Ideally,  each  averaging  kernel  should  resemble  a narrow  Gaussian  centered  on  the  point 

In  practice  the  rows  of  K rarely  admit  good  resemblances  to  Gaussians,  so  the  estimates  are  usually 
non-symmetric  weighting  functions  that  change  from  one  £*  to  the  next.  The  resulting  set  of  point  estimates 
(£k,<t>k)  comprise  a smoothed  discrete  approximation  to  the  unknown  function  i(£).  This  smoothing  is 
difficult  to  characterize  and  non-uniform  over  the  range  of  £. 

If  w is  not  expressible  as  a linear  combination  of  rows  of  K , then  the  average  wT  x is  said  to  be 
inestimable.  This  does  not  mean  that  no  unbiased  estimate  exists,  but  rather  that  there  exists  no  unbiased 
estimator  of  the  form  uT y.  In  theory,  there  exist  unbiased  estimates  of  the  form  wT  x(z)  where  x (z)  is 
some  solution  (3.14)  of  the  underdetermined  least  squares  problem.  In  reality,  it  is  necessary  to  accept  the 
fact  that  computable  estimates  will  be  biased  and  to  try  to  assure  that  the  errors  due  to  bias  axe  small  in 
comparison  to  the  random  variations  introduced  by  the  e. 

In  Section  2 we  presented  the  generalized  inverse  solution  for  Problem  II.  It  is  apparent  from  the  bottom 
frame  of  Figure  1 that  any  bias  in  this  approximation  is  smaller  than  the  scatter  due  to  the  random  errors. 
In  Figure  3 we  give  the  histogram  approximations  corresponding  to  the  2-point  averaging  windows  and  the 
3-point  averaging  windows.  Even  though  the  averaging  considerably  reduces  the  scatter  due  to  the  random 
errors,  it  is  still  not  possible  to  detect  the  bias.  It  would  seem  that  the  bias  is  not  so  large  that  it  rules  out 
the  possibility  of  obtaining  useful  information  about  the  function  x(£). 


4.  Classical  One-at-a-Time  Confidence  Intervals 


For  any  unbiased  estimator  <j>  — uTy,  one-at-a-time  confidence  intervals  for  <f>  — wTx  can  be  constructed 


from  the  transformed  random  variable 

> n 


<t>-4> 


T A T — 

u1  y — w1  x 


\/Var  (<f>)  y/  uT  S3u 

where  the  latter  equality  follows  from  (3.4).  It  also  follows  from  (3.4)  and  (3.5)  that 

£(t])  = 0 , Var(7?)  = 1 . 


(4.1) 


For  a given  probability  a,  a 100a%  confidence  interval  for  4>  can  be  constructed  by  determining  a number  /c 
such  that 

Pr  { — k < r]  < +/c  } = a . 

Substituting  (4.1)  and  rearranging  gives 


(4.2) 
iber  k. 

(4.3) 


Pr  | | <j>  — /c\/Var(</>) 


< <f>  < 


<t>  + «\/Var($)  | j = a , 


or 


Pr  | ^ uTy  — k\/ut  S3u  j < wT  x < (^uTy  + k\/  ut  S3u  j j = a . 


Thus,  the  interval 


I(or,n)=  (yUT  y — k,\/ut  S3uj  , ^ uTy  + K\/uTS3u'j 


(4.4) 

(4.5) 

(4.6) 


is  a 100 a.%  confidence  interval  for  <f>. 

Given  the  value  k,  the  foregoing  results  are  valid  for  any  unbiased  estimator  uTy.  The  best  linear 
unbiased  estimator  (3.9)  gives  the  confidence  interval  J(a ; re)  = <j>up],  where 

2.lo  a T a * / * T c*2  a 

<t>  = u y - kVuS_u  , . . 

(f)uP  — uT y -)-  K.y/ uTS2u  . 

It  is  clear  from  Eqs.  (3.8),  (3.9)  and  (3.10)  that  these  bounds  can  also  be  written 


4>l°  = wTx  - k^/wt(KtS-3K)-'w  , 
p = WTX  + Ky/l VT{KTS-3K)~'w  , 


(4.8) 
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* 


108  X 121  Phillips  Problem 
Gen.  Inverse  Solution,  3-Point  Averaging 


* 


Figure  3:  Generalized  inverse  estimates  for  Problem  II.  Dashed  lines  are  the  exact  solution  x({)  and  solid 
lines  are  histogram  estimates  <£fc. 

Top:  2-point  averages  over  intervals  [£*, Bottom:  3-point  averages  over  intervals  [^2fc  — l , fat  + i]- 
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where  x is  the  least  squares  solution  vector  (3.11). 

The  width  of  the  confidence  interval  (4.6)  is  directly  proportional  to  the  value  of  k.  Optimally  narrow 
confidence  intervals  can  be  computed  only  if  the  probability  density  function  for  77  is  known,  and  this  requires 
a knowledge  of  the  joint  probability  density  function  for  the  vector  e. 

4.1.  Confidence  Intervals  for  Normally  Distributed  Errors 

In  many  applications,  e is  known  (or  assumed)  to  have  a multivariate  normal  distribution,  e ~ N (o,S2), 
so  77  follows  the  standard  normal  distribution,  77  ~ n(0, 1).  This  means  that  for  any  a (0  < a < 1),  it  is 
possible  to  find  a corresponding  value  k (0  < k < 00)  satisfying 

WrL  “P  (-?)*'  = “ ' 

which  defines  the  relation  between  the  /c  and  the  a required  in  (4.4). 

As  an  example,  consider  Problem  I and  the  best  linear  unbiased  estimators  for  the  2-point  averaging 
window  vectors  (2.18).  Using  k — 1.960,  for  95%  intervals,  Equations  (4.8)  give  the  confidence  bounds 
plotted  as  solid  lines  in  the  top  frame  of  Figure  4.  There  are  120  window  vectors  in  the  whole  set,  so  we 
expect  approximately  6 of  the  confidence  intervals  to  fail  to  bracket  the  corresponding  true  averages.  In  this 
particular  problem  there  were  9 bracket  failures. 

4.2.  Confidence  Intervals  From  Chebyshev’s  Inequality 

If  the  joint  probability  density  function  for  e is  not  known,  then  valid,  suboptimal  confidence  intervals  can 
be  constructed  by  using  Chebyshev’s  inequality  [9,  Chapt.  1],  which  guarantees  that,  for  any  k > 0, 


Pr  { | 77  - 5(77)  | > KVar(77)  } < \ . (4.10) 

It  follows  then  that,  for  any  unbiased  estimator  uTy , 

Pr  | | uTy  — wT x | < kV ut  S3u  j > 1 ^ . (4-11) 

Therefore,  for  any  a (0  < a < 1),  if 

K = > (412) 

then 

Pr  | ^ uTy  — k\/  uT  S 3 u^j  < wTx  < ^ uT  y -f  re\/ uT  S3u^j  | > a , (4-13) 

so  the  interval  (4.6)  is  a 100a%  confidence  interval  for  <f>. 


Confidence  intervals  calculated  from  Chebyshev’s  inequality  are  very  conservative  since  they  must  be  wide 
enough  to  accommodate  any  possible  probability  density  for  77.  Table  1 compares  the  sizes  of  Chebyshev 
intervals  and  normal  distribution  intervals  for  some  commonly  used  confidence  levels.  The  ratio  is  the  factor 


(4.9) 


Table  1:  fc-values  for  Chebyshev  inequality  and  standard  normal  distribution 


a 

/c-Chebyshev 

/c-normal 

ratio 

0.6667 

1.732 

0.967 

1.79 

0.95 

4.472 

1.960 

2.28 

0.99 

10.000 

2.575 

3.88 

0.999 

31.622 

3.295 

9.60 

by  which  the  interval  width  must  be  expanded  if  the  form  of  the  error  distribution  is  not  known. 

As  an  example,  consider  again  Problem  I and  the  best  linear  unbiased  estimators  for  the  2-point  averaging 
window  vectors.  The  bottom  frame  of  Figure  4 is  a plot  of  the  95%  one-at-a-time  confidence  interval 
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150  X 121  Phillips  Problem 
One-at-a-time  Confidence  Intervals 
Using  Chebyshev  Theorem,  a = 0.95 
Classical  Least  Squares,  2-Point  Averaging 


£ 

Figure  4:  Classical  95%  one-at-a^time  confidence  intervals  for  Problem  I.  Dashed  lines  are  exact  solution 
x(£),  and  solid  lines  are  histograms  of  estimated  lower  and  upper  bounds. 

Top:  2-point  averages  from  normality  assumption.  Bottom:  2-point  averages  from  Chebyshev’s  Inequality. 
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bounds  obtained  from  Chebyshev’s  inequality  for  those  averages.  The  solid  lines  are  histograms  of  the 
upper  and  lower  bounds  calculated  from  (4.8)  using  re  = 4.472.  These  intervals  are  larger  by  a factor  of 
4.472/1.960  sa  2.28  than  the  corresponding  intervals  (in  the  top  frame)  obtained  by  assuming  normally 
distributed  errors.  The  dashed  line  is  the  true  solution.  Note  that  none  of  the  true  values  of  the  averages 
lies  outside  the  corresponding  confidence  interval. 


5.  The  Geometry  of  Confidence  Interval  Estimation 

5.1.  Full  Rank  Problems 

No  matter  what  the  relation  between  a and  re,  the  corresponding  confidence  interval  bounds  for  the  best 
linear  unbiased  estimator  (3.9)  can  be  calculated  from  either  Eq.  (4.7)  or  (4.8).  It  is  not  difficult  to  show 
that  the  end-points  of  this  interval  are  also  defined  by  the  two  constrained  estimation  problems: 


<j>l°  — min  { wT x | ( y — Kx)TS  2(y  — Kx)  = f2  + re2  } , 
<^up  — max{io:ra;  | (y  — Kx)T  S~2(y  — Kx)  = f2  + re2  } , 


(5.1) 


where  f2  is  the  minimum  sum  of  squares  (3.12).  These  problems  can  be  solved  by  Lagrange  multipliers  to 
give  solution  vectors 


• lo 


• up 


= x — 


- , 

= X + 


\/wT(KT  S~3K)~1w 
re 

s/wt{Kt  S~3K)~'w 


(. KTS~3K ) 1 w , 
{KtS-*K)~'w  , 


(5.2) 

(5.3) 


and  values  for  cf>l°  and  <f>up  that  are  the  same  as  the  values  given  by  (4.8). 

The  common  constraint  region  for  problems  (5.1),  i.e., 

<S(re)  = {x  | (y  - Kx)TS~2(y  - Kx)  = f2  + re2  } , 

can  also  be  written  [17,  Chapt.  2] 

S(k ) — | 35  I (*  — x)T  KtS~3K  (x  — x)  — re2  | , 


(5.4) 


(5.5) 


which  defines  the  surface  of  an  ellipsoid  centered  on  the  least  squares  solution  x.  The  size  of  this  ellipsoid 
varies  with  re.  For  any  value  of  the  parameter  </>,  the  set  of  points 


H(w,<l>)  = { 


T~  = 4>} 


w x 


(5.6) 


forms  a hyperplane  orthogonal  to  the  vector  w.  Therefore,  the  bounds  and  <^up  are  the  values  of  0 on 
the  two  tangential  support  planes  of  the  ellipsoid  <5(re)  that  are  orthogonal  to  w. 

The  location  of  the  center  of  the  ellipsoid  <S(re)  depends  on  y,  and  its  size  is  scaled  by  the  value  of  re, 
but  its  orientation  and  shape  are  completely  determined  by  the  matrix  K1-  S~3 K . Let  the  singular  value 
decomposition  of  S'-1  If  be 

S~'K  = U ( 5 1 VT  , (5.7) 

(5.8) 


with 


UTU  = Im  = UUT  , VTV  = In  = VVT  , 27  = diag(<rlf<r3,...,an)  , 

with  (7 1 > cti  > • • • > crn.  It  follows  then  that 

KtS~3K  = VSaVT  . 


(5.9) 


It  is  easy  to  see  [17,  Chapt.  2]  that  the  mutually  orthogonal  vectors  «i,  V2,  ■ ■ ■ , vn  define  the  directions  of 
the  major  axes  of  *S(re),  and  the  lengths  of  those  axes  are  given  by 


2re 

4j  — , 1=1,2,.  ..,71 

CTi 


(5.10) 
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1 

2 

- 

4>up  - - 2k 

Y- 

a2 

{VIW) 

— 

[;=i  aJ 

J = l 

From  (4.8),  (5.9)  and  the  definition  of  S,  it  follows  that  for  any  window  vector  w,  the  corresponding  a-level 
confidence  interval  has  width 


(5.11) 


Since  Vj  is  a unit  vector  (i.e.,  vjvj  = 1),  the  scalar  wTvj  is  just  the  projection  of  w on  the  jth  major  axis 
of  the  ellipsoid,  and  ij  is  the  length  of  that  major  axis. 

Linear  regression  models  obtained  from  first  kind  integral  equations  are  almost  always  poorly  conditioned. 
The  major  axes  of  S(k ) corresponding  to  the  smaller  singular  values  are,  therefore,  usually  greatly  elongated. 
From  (5.11)  it  follows  that  if  w has  a non-zero  projection  on  any  of  the  singular  vectors  corresponding  to  the 
smaller  singular  values,  then  the  confidence  interval  for  wT x will  be  very  wide.  Almost  every  window  vector 
designed  to  elicit  information  about  the  function  *(£)  will  have  non-zero  components  in  the  directions  of  these 
elongated  axes.  In  the  extreme  case,  one  or  more  of  the  singular  values  are  zero  and  the  corresponding  axes 
of<S(/c)  are  infinite  in  length,  yielding  infinite  intervals.  When  such  a problem  is  read  into  a computer,  the 
machine  truncation  errors  almost  always  produce  a full  rank  matrix  on  the  machine,  but  there  is  no  danger 
of  obtaining  deceptively  good  confidence  intervals  as  a result.  If  cm  is  the  machine  epsilon,  then  typically  for 
a fa  whose  true  value  is  of  order  unity,  the  classical  estimates  (4.7)  give  intervals  like  [— 0(e  M ) i +^(eM1)] 
which,  though  shorter  than  [— oo  , +oo],  are  nevertheless  useless. 

5.2.  Under  determined  Problems 

For  problems  with  m < n,  the  matrix  is  exactly  rank  deficient  with  <rm+i  = crm+2  = • • • = crn  = 0.  The 
ellipsoid  S(k)  has  at  least  n—  m infinitely  long  major  axes,  and  its  center  is  not  a single  point  x but  rather 
a coset  of  the  subspace  spanned  by  the  singular  vectors  rm+i,  vm-t_2i  • • • > vn,  be.,  the  set  of  points  defined 
by  (3.14).  In  that  expression,  the  set  of  points 


Af(S-'K)  = { [in  - (5-1AT)t  {S-'K)] 


z arbitrary  j 


(5.12) 


is  the  subspace,  and 

x = S^y  (5.13) 

is  the  displacement  of  the  coset  from  the  origin.  If  w is  orthogonal  to  the  subspace  Af(S~1  K),  then  it  is  an 
exact  linear  combination  of  the  rows  of  K , so  the  best  linear  unbiased  estimator  for  wT x is  given  by  (3.16), 
and  its  variance  is  given  by  (3.17).  The  confidence  interval  bounds  are 


<f>l°  = wTx  - Ky/wT(KTS-3Kyw  , 
= wTx  + k^wt{Kt S~3Kyw  , 


(5.14) 


where  x is  the  minimal  length  solution  (5.13). 

If  w has  a non-zero  projection  on  any  of  the  infinite  axes  of  S(i c),  then  wTx  is  said  to  be  inestimable 
because  there  exists  no  unbiased  estimator  of  the  form  <j>  — uT y.  The  center  coset  does  contain  at  least  one 
point 

(5.15) 


x{z)  = (S'1*:)'  S~ly  + Jn  - ( S XK ) 


for  which  wTx(z ) gives  an  unbiased  estimate.  Taking  z — x would  give  such  an  estimate,  but  of  course  x is 
unknown.  Even  if  it  were  possible  to  find  some  z which  gives  an  unbiased  estimate,  the  confidence  interval 
for  wT  z would  still  be  unbounded. 

6.  Classical  Simultaneous  Confidence  Intervals 


The  method  for  estimating  a set  of  a-level  simultaneous  confidence  intervals  for  a whole  set  of  window  vectors 
{tOu  w3, . . . ,wp}  is  based  on  an  a-level  confidence  ellipsoid  for  the  unknown  vector  x.  For  any  positive 
value  7^,  the  region  in  x-space  defined  by 


«S(7n)  = { * I (*  - x)T  KTS  3K  (x  - i)  = 7^  } 


(6.1) 


15 


is  an  ellipsoid  centered  on  the  least  squares  solution  vector  x (3.11).  It  becomes  an  a-level  confidence  ellipsoid 
for  x if  7^  is  determined  so  that 


Pr  | (a;  — x)T  KT  S 3 K (x  — x)  < 7^  j = a . 


(6.2) 


The  required  set  of  simultaneous  confidence  intervals  is  constructed  from  the  support  planes  for  this  ellipsoid. 


In  particular,  the  set  of  intervals  Jfc(a;7^)  = , 4> 


, defined  by 


tfil°  — min  {tufa;  | (a:  — x)T  KT  S~3K  (x  — x)  = 7^  | , 
— rrmx|tr^x  | (x  — x)T  KTS~3K  (x  — x ) — 7^  j 


i ^ — 1 1 2, . . . , p , 


(6.3) 


satisfy 


Pr  { 4>‘k  < 4>k  < , fc=l,2,...,p}>a  , 


(6.4) 


and  this  result  holds  for  any  number  of  averaging  vectors.  These  simultaneous  confidence  intervals  are  defined 
by  the  support  planes  of  an  ellipsoid  in  the  same  way  as  are  the  one-at-a-time  intervals  given  by  (5.1).  The 
only  difference  is  in  the  size  of  the  defining  ellipsoid.  It  is  easy  to  see  that  if  rank(AT)  = n < m,  then  the 
solutions  to  the  constrained  optimization  problems  (6.3)  are 


<t>k  = nyJwl(KTS  3K)  1 wk  , 1 

A / / > k — 1 > 2, . . . , p 

- ^**  + 7 n^Jwl(KT S~3K)-1wk  J 

where  x is  the  least  squares  solution  vector  (3.11). 


(6.5) 


6.1.  Simultaneous  Intervals  for  Normally  Distributed  Errors 
For  the  case  where  e is  normally  distributed  and  K has  full  column  rank, 


so 


) ~ X2(rc)  , 

(6.6) 

} = J X2(n\q)dq  , 

(6.7) 

where  X2{n\q)  is  the  probability  density  function  for  the  x2(n)  distribution.  Thus,  for  any  probability  a,  if 
7^  is  defined  by 


/ X2{n;q)dq  = a , 
Jo 


(6.8) 


then  the  corresponding  ellipsoid  S(yn)  will  be  a 100a%  confidence  ellipsoid.  Most  statistics  handbooks 
tabulate  7^  as  a function  of  a and  n for  values  n < 30  (cf.  [1],  Chapt.  V.l).  For  larger  values  of  n,  the 
random  variable 

v=  y/2^  (6.9) 

follows,  to  a good  approximation,  a standard  normal  distribution,  so  for  a given  a,  the  corresponding  7^  can 
be  computed  by 


72  = \ [*«  + V2n-  l]2 
where  Aa  is  the  a-point  of  the  standard  normal  distribution,  i.e., 


n(0 , 1 ; 77)  dt)  — a 

■ OO 


(6.10) 


(6.11) 
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Table  2:  The  ratio  jn/^  for  normally  distributed  errors 


a 

n = 20 

n = 30 

n = 50 

n = 70 

n = 100 

n = 150 

n = 200 

0.6667 

4.9 

5.9 

7.6 

8.9 

10.6 

13.0 

14.9 

0.95 

2.9 

3.4 

4.2 

4.9 

5.7 

6.8 

7.8 

0.99 

2.4 

2.8 

3.4 

3.9 

4.5 

5.4 

6.1 

0.999 

2.0 

2.3 

2.8 

3.2 

3.7 

4.4 

4.9 

It  is  interesting  to  compare  the  widths  of  the  simultaneous  intervals  (6.5)  with  those  of  the  one-at-a-time 
intervals  (4.7).  The  ratio  of  the  corresponding  interval  widths  is 


simultaneous  interval  width  7 n 

single-dimensional  interval  width  k 


(6.12) 


This  quantity  is  tabulated  in  Table  2 for  several  values  of  a and  n. 

As  an  example,  we  again  consider  Problem  I with  2-point  averaging  windows  and  seek  95%  simultaneous 
intervals.  The  95%-point  for  the  standard  normal  distribution  is  Aa  = 1.645  which,  by  Eq.  (6.10),  gives 
7^  = 12.14.  Using  this  value  in  (6.5)  gives  the  intervals  shown  in  the  top  frame  of  Figure  5.  Comparing 
this  plot  with  the  one  in  the  top  frame  of  Figure  4 gives  a rough  idea  of  the  price  that  must  be  paid  for  the 
additional  certainty  of  simultaneous  intervals. 


6.2.  Simultaneous  Intervals  from  Chebyshev’s  Inequality 

If  the  form  of  the  joint  probability  density  function  for  e is  unknown,  then  valid,  though  conservative, 
confidence  ellipsoids  can  be  constructed  from  Chebyshev’s  inequality.  For  a given  a,  it  is  necessary  to 
determine  a corresponding  jn  such  that 

Pr  {*  6 S(7n)  } = Pr  j (*  - x)T  VS3 V7  (x  - x)  < 7^  } > a . (6.13) 


A multivariate  Chebyshev  inequality  given  by  Olkin  and  Pratt  [13]  ensures  that 
will  be  satisfied  if 

n 


7n  = r. • 

V 1 — OL 


for  any  a,  (0  < a < 1),  this 


(6.14) 


As  an  example  we  consider  the  simultaneous  95%  confidence  intervals  corresponding  to  a set  of  5-point 
averaging  windows  for  Problem  I.  The  window  vectors  are  given  by  (2.20)  with  v — 5.  For  n = 121  and 
a = 0.95,  the  corresponding  7n  is  7n  = 541.1.  The  confidence  bounds  are  plotted  as  histograms  in  the  bottom 
frame  of  Figure  5.  In  spite  of  the  high  degree  of  averaging,  the  demand  for  the  certainty  of  simultaneous 
intervals  coupled  with  a lack  of  knowledge  about  the  distribution  of  the  errors  has  produced  intervals  that 
are  almost  useless. 


7.  Nonnegatively  Constrained  Estimation 

The  least  squares  problem  arising  from  the  discretization  of  a system  of  first  kind  integral  equations  is  usually 
poorly  conditioned  with  respect  to  small  variations  in  the  measured  vector  y.  Quite  often,  the  exact  matrix 
K is  rank  deficient  which  means,  in  the  classical  theory,  that  the  linear  functions  cj> * = w^x  may  not  even 
be  estimable.  Fortunately,  in  cases  where  the  solution  x(£)  of  the  integral  equations  is  physically  constrained 
to  be  nonnegative,  the  imposition  of  those  constraints  on  the  solution  vector  x almost  always  stabilizes  the 
estimation  problems  and  gives  biased,  but  bounded,  physically  plausible  solutions,  no  matter  what  the  value 
of  rank(Ff).  Therefore  we  append  the  constraint  x > o to  the  linear  estimation  model  (1.1),  (1.2),  i.e. , 

y = Kx  + e , £(e)  — o , £ (^eeT j = S2  , Pr  {x  > 0}  = 1.0  . (7-1) 
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Classical  Least  Squares,  2-Point  Averaging 
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150  X 121  Phillips  Problem 
Simultaneous  Confidence  Intervals 
Using  Chebyshev  Theorem,  a = 0.95 
Classical  Least  Squares,  5-Point  Averaging 


* 


Figure  5:  Classical  95%  simultaneous  confidence  intervals  for  Problem  I.  Dashed  lines  are  the  exact  solution 
2(0>  and  solid  lines  are  histograms  of  estimated  lower  and  upper  bounds. 

Top:  2-point  averages  from  normality  assumption.  Bottom:  5-point  averages  from  Chebyshev’s  Inequality. 
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7.1.  Constrained  Point  Estimation 

When  the  nonnegativity  constraint  is  appended,  the  point  estimation  problem  becomes 

Pmin  = min  {(Kx  — y)T  S~2(Kx  — y)}  = r2  + min  {(x  - x)T KT S'3 K(x  - x)}  , (7.2) 

*>o  x>o 

with  f2  defined  by  (3.12).  If  x denotes  the  solution  vector,  i.e. , if 

Pmin  = (Kx-y)T S-\Kx-y)  , (7.3) 

then  estimates  of  the  ensemble  of  averages  can  be  computed  by 

<t>k=wlx,  Jc  - l,2,...,p  . (7.4) 

Equation  (7.2)  defines  a nonnegatively  constrained  least  squares  problem  for  which  there  is  no  explicit 
closed  form  solution,  but  good  algorithms  for  the  numerical  solution  have  been  available  since  the  mid-1970s 
[10].  Figure  6 gives  plots  of  these  solutions  for  Problems  I and  II.  Comparison  of  these  plots  with  those  in 
Figure  1 shows  that  the  nonnegativity  constraints  gave  good,  but  not  spectacular,  improvement.  For  most 
real-world  ill-posed  problems,  the  improvement  is  much  more  dramatic. 


7.2.  Constrained  Interval  Estimation 


The  classical  method  for  estimating  confidence  intervals  can  also  be  extended  to  take  the  nonnegativity 
constraints  into  account.  For  both  one-at-a-time  and  simultaneous  intervals,  the  classical,  unconstrained 
interval  estimation  problems  can  be  written  in  the  form 


4>l°  — min  { wTx  \ ( y — Kx)TS  2(y—Kx)=/32}  , 
4>up  = max  { wTx  | (y  — Kx)T  S~2(y  — Kx)  — (32  } , 

X 


where  /32  is  a constant  chosen  to  give  the  desired  confidence  level  a.  We  will  show  in  the  following  that  valid 
nonnegatively  constrained  confidence  intervals  can  be  obtained  by  solving  problems  of  the  form 


tj>l°  = min  { wT  x 

X 

$up  = max  { wT  x 


(y-Kx)TS-2(y-Kx)  = y.2  , 

(y-Kx)TS-2(y-Kx)  = n2 


x > o } 

X > o } 


(7.6) 


where  y,  = j3  for  simultaneous  confidence  intervals,  but  p differs  from  (3  for  one-at-a-time  intervals. 

The  common  constraint  region  in  problems  (7.6)  is  the  intersection  of  the  n -dimensional  /i-ellipsoid  with 
the  positive  orthant.  For  any  reasonable  discretization  of  the  integral  equations,  the  value  of  n will  be 
large  enough  to  assure  that  the  positive  orthant  constitutes  a very  small  fraction  of  n-space.  The  resulting 
intersection  is  generally  much  smaller  in  all  directions  than  is  the  /^-ellipsoid  itself. 

Even  for  rank  deficient  matrices  K this  intersection  is  almost  always  bounded.  If  K is  rank  deficient, 
then  one  or  more  of  the  smallest  eigenvalues  of  K 'T  S~3  K will  be  zero,  and  the  ellipsoid  will  be  unbounded 
in  the  directions  of  the  corresponding  eigenvectors.  The  intersection  of  the  ellipsoid  with  the  positive  orthant 
will,  however,  be  unbounded  only  if  at  least  one  of  these  degenerate  eigenvectors  lies  in  the  positive  orthant. 
Since  a;-space  has  2”  orthants  and  KT  S~3K  has  fewer  than  n degenerate  eigenvalues,  it  will  be  a very  rare 
occurrence  for  one  of  the  corresponding  eigenvectors  to  extend  into  the  positive  orthant.  In  many  physical 
applications  the  matrix  K is  nonnegative,  and  in  such  cases,  * > o,  x ^ o =>  KT S~3 Kx  ± 0 • x\  thus, 
for  these  applications,  no  degenerate  eigenvector  could  lie  in  the  positive  orthant. 

It  is  not  possible  to  write  closed  form  solutions  for  the  problems  (7.6),  but  numerical  calculation  of  the 
solutions  is  possible.  The  first  person  to  employ  nonnegativity  constraints  in  order  to  reduce  the  size  of 
estimated  confidence  intervals  was  Walter  R.  Burrus  [2]  who  used  the  technique  in  neutron  and  gamma-ray 
spectrum  unfolding  problems.  Burrus  did  not  give  an  algorithm  for  solving  (7.6),  but  he  and  his  colleagues 
developed  several  algorithms  [2],  [3],  [4],  [15],  [16]  for  computing  suboptimal  interval  approximations  to 
l4>l°  ,(f>up]-  In  spite  of  the  suboptimality,  the  intervals  thus  obtained  were  uniformly  much  better  than 
the  classical  intervals  (4.7).  O’Leary  and  Rust  [12]  have  developed  an  algorithm  called  BRACKET-LS  for 
computing  the  optimal  intervals.  This  algorithm  was  used  to  compute  the  confidence  intervals  given  in  the 
remainder  of  this  paper. 
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108  X 121  Phillips  Problem 
Nonnegatively  Constrained  Least  Squares  Solution 


£ 

Figure  6:  Nonnegatively  constrained  solutions  to  the  test  problems.  Dashed  lines  are  exact  solutions  x(£), 
and  solid  lines  are  least  squares  estimates  = e^x. 

Top:  Problem  I.  Bottom:  Problem  II. 
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7.3.  Simultaneous  Confidence  Intervals 


The  extension  of  the  classical  theory  to  accommodate  nonnegativity  constraints  in  the  estimation  of  one-at- 
a-time  intervals  is  complicated,  so  we  defer  it  to  the  next  section.  The  extension  to  estimating  simultaneous 
intervals  is  straightforward  because  the  intersection  of  a confidence  ellipsoid  with  the  positive  orthant  is  itself 
a confidence  region  with  the  same  confidence  level  as  the  ellipsoid.  Let  7 be  chosen  so  that 


•5(7)  = {*1  {y-Kx)TS  2(y  - Kx)  < f2  + j2  } 

= j x | (a;  — x)T  KtS~3K  (x  — x)  < 72  j ’ 

is  a 100q%  confidence  ellipsoid  for  x,  and  define  events  A and  B by 

A = {(y  - Kx)TS~2(y  - Kx)  < f2  + 72}  , B = {x  > 0}  . 
These  events  have  probabilities 

P(A)  = a,  P(B)  = 1.0  , P(AuS)=1.0, 

so  it  follows  from  the  elementary  identity, 

P(A  nB)  = P(A)  + P(B)  - P(A  U B)  , 

that 

Pr  { (y  — Kx)t  S~2(y  — Kx)  < f2  -f-  72  , ® > 0 } = a . 

Therefore,  if 

if>l°  — min  { w^x  | (y  — Kx)T  S~2(y  — Kx)  < r2  + 72  , x > 0}  , 

^p  = max{ic^aj  | (y  — Kx)T S~2(y  — Kx)  < f2  + 72  , x > 0}  , 

for  k = 1,  2, . . . , p,  then  it  follows  that 

Pr{  4>k  < wlx  < ft?  , i = l,2 pj  = a , 


(7.7) 

(7.8) 

(7.9) 

(7.10) 

(7.11) 

(7.12) 

(7.13) 


and  this  result  holds  for  any  number  p of  window  vectors. 

In  Section  6.1  we  saw  that  for  full-rank  problems  with  normally  distributed  errors,  a 100a%  confidence 
ellipsoid  is  obtained  by  choosing  72  to  be  the  a-point  of  the  X2(n)  distribution  [cf.  Eq.  (6.8)].  With  the  ad- 
dition of  nonnegativity  constraints,  it  becomes  possible  to  estimate  confidence  intervals  for  underdetermined 
problems,  but  the  choice  of  72  is  different.  Rust  and  Burrus  [17,  Chapt.  6]  proved  that  if  y is  normally 
distributed,  and 

. x = {S-'K)' S~ly  = (IC1  S~*K)]  KT S~3y  , (7.14) 

is  the  corresponding  generalized  inverse,  least  squares  solution  vector,  then 

(x  — x)T KT S~3 K(x  — x)  ~ X2(u)  i (7.15) 


where  u — rank(liC).  Therefore,  if  7 is  taken  to  be  the  a-point  of  the  x2(l/)  distribution,  then  Eqs.  (7.12) 
define  the  confidence  interval  which  can  be  computed  with  BRACKET-LS  algorithm  [12].  Determining  the 
exact  rank  v is  often  a difficult  and  uncertain  procedure,  but  it  is  better  to  overestimate  the  rank  than  to 
underestimate  it.  In  the  former  case,  the  confidence  intervals  will  be  suboptimal  (larger  than  necessary  to 
assure  the  confidence  level),  but  in  the  latter  case  they  will  be  dishonest  (too  small  to  assure  the  claimed 
confidence  level).  A conservative  procedure  is  to  choose  v — min{m,  n}. 

Figure  7 gives  plots  of  the  simultaneous  confidence  intervals  computed  by  BRACKET-LS  for  Problem  I 
using  2-point  averaging,  and  Problem  II  using  3-point  averaging,  when  the  measuring  errors  are  normally 
distributed.  The  bounds  for  Problem  I can  be  compared  with  those  in  the  top  frame  of  Figure  5 to  assess 
the  improvement  obtained  by  using  the  nonnegativity  constraints.  The  improvement  is  even  more  dramatic 
for  Problem  II  because  the  intervals  are  all  unbounded  for  the  unconstrained  problem.  The  nonnegativity 
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Figure  7:  Nonnegatively  constrained,  simultaneous  95%  confidence  intervals  from  the  normal  distribution. 
Dashed  lines  are  the  exact  solutions  a:(£),  and  solid  lines  are  histograms  of  estimated  lower  and  upper  bounds. 
Top:  Problem  I,  2-point  averages.  Bottom:  Problem  II,  3-point  averages. 
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constraints  make  it  possible  to  compute  useful  confidence  intervals  even  though  the  underlying  linear  model 
is  underdetermined. 

When  the  form  of  the  error  distribution  is  unknown,  the  procedure  is  the  same  except  7 is  given  by 
(6.14).  Figure  8 gives  plots  of  the  bounds  obtained  for  both  Problems  I and  II  using  5-point  averaging. 
Even  with  this  higher  degree  of  averaging,  the  bounds  are  poor  in  comparison  with  those  in  Figure  7,  but 
they  are  considerably  stronger  than  those  in  the  bottom  frame  of  Figure  5 which  were  obtained  without  the 
nonnegativity  constraints. 


8.  Constrained  One-at-a-Time  Confidence  Intervals 

8.1.  The  Burrus  Conjecture 

Nonnegatively  constrained  one-at-a-time  confidence  intervals  cannot  be  defined  in  the  same  manner  as  con- 
strained simultaneous  intervals.  More  precisely,  if  k is  defined  by  (4.9)  or  by  (4.12),  then  valid  100a% 
one-at-a-time  intervals  cannot  be  calculated  from  Equations  (7.6)  with  p2  = f 2 + k2.  In  fact,  it  almost 
always  happens  that 

f2  + K2  < pmin  - min  {(y  - Kx)TS~2(y  - Kx)}  , (8.1) 

i.e.,  the  ellipsoid  (5.4)  had  no  points  in  common  with  the  positive  orthant.  This  difficulty  arises  because  the 
regression  models  almost  never  have  m >>  n.  Accordingly,  the  classical  least  squares  procedure  produces  a 
solution  vector  x which  models  a large  portion  of  the  measurement  errors  e and  hence  gives  an  unrealistically 
low  value  for  f2.  This  fact  was  pointed  out  by  Walter  R.  Burrus  [2]  who  also  noted  that  the  ellipsoid  defined 
by 

(y  - Kx)T S~2(y  - Kx)  < Pmin  + «2  (8.2) 

always  has  a non-empty  intersection  with  the  positive  orthant.  He  conjectured  [17,  Chapt.  6]  that  valid 
confidence  intervals  could  be  obtained  by  using  that  intersection  as  the  constraint  region,  i.e.,  by  solving  the 
problems 

<pl°  - nun  { wT x | (y  — Kx)T  S~2(y  - Kx)  — p^n  + /c2  , x > o } , 

<f>up  - max  { wTx  | (y  - K x )T S~2(y  - Kx)  = Pmin  + k2  , x > o } . 

X 

We  shall  show  in  the  following  that  this  conjecture  is  indeed  correct. 


8.2.  The  Duality  Theorem 

In  order  to  prove  the  Burrus  Conjecture,  it  will  be  necessary  to  restate  problems  (8.3)  in  equivalent  forms 
defined  by  the  Duality  Theorem  for  Nonlinear  Programming  [20]: 

Wolfe’s  Duality  Theorem.  Suppose  z is  an  iV-vector  of  unknown  variables,  /(z)  is  a scalar 
function  and  y(z)  is  an  M-vector  function  of  z,  and  that  /(z)  and  y(z)  are  all  convex  and 
differentiable  on  an  open  set  Z.  If 

4>P  - min  {/(z)  | g(z)>o},  (8.4) 

and  the  constraints  g(z)  > o satisfy  a constraint  qualification,  then 

<t>p  = <t>D  , (8-5) 

where  $£>  is  determined  by  the  dual  problem 

4>d  = {max  j/(z)  - gTv  | V/(z)  = v > 0 j • (8-6) 

Equation  (8.4)  is  called  the  primal  problem.  The  constraint  qualification  can  be  any  one  of  several  regularity 
conditions  on  either  the  convexity  or  differentiability  properties  of  the  functions  ft(z).  More  details  on  such 
conditions  can  be  found  in  [11,  Chapts.  5,  7]. 
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Figure  8:  Nonnegatively  constrained,  simultaneous  95%  confidence  intervals  from  Chebyshev’s  Theorem. 
Dashed  lines  are  exact  solutions  z(£),  and  solid  lines  are  histograms  of  estimated  lower  and  upper  bounds. 
Top:  Problem  I,  5-point  averages.  Bottom:  Problem  II,  5-point  averages. 
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It  is  instructive  to  apply  this  theorem  to  the  classical,  full-rank  interval  estimation  problems  (5.1).  To 
simplify  the  notation,  let 

/i  = +\/f2  + k2  , (8.7) 

and  consider  first  the  lower  bound  problem  which  can  also  be  written  with  inequality  constraints, 

<j>l°  = min  { wTx  \ (y  — K x)T S~2 (y  — K x)  < y?  } , (8-8) 

because  the  indicated  minimum  is  attained  on  the  boundary  of  the  ellipsoid.  Many  equivalent  forms  of  the 
dual  problem  can  be  derived,  but  it  is  easiest  to  get  the  form  we  need  by  using  the  artifice  of  writing  the 
problem  in  terms  of  the  original  variables  x and  the  scaled  residual  vector 


r = S 1(y  - Kx). 

The  vector  r can  be  specified  by  two  vector  inequalities, 

r — S-1(y  — Kx)  > o , — r + S_1(y  — Kx)  > o . 
Reformulating  (8.8)  in  terms  of  the  ( N — m + n)  variables 

z^(*T,rT)T  , 

gives 

/ -S-1(y  - Kx)  + r \ / o \ 

g{z)=  +S~1{y  - Kx)  - r > o 

\ M - \/rTr  ) \ o / 

for  the  constraints  and 

f(z)  = (wT  , oT  ) z 

for  the  objective  function.  The  problem  then  becomes 

4>l°  = 4>P  = min  { f(z)  | g(z)>o}  , 

Z&Z 


(8.9) 

(8.10) 

(8.11) 

. (8.12) 

(8.13) 

(8.14) 


where  Z — 1ZN . The  functions  f(z)  and  g(z)  are  convex  and  differentiable  everywhere,  and  the  constraints 
g(z)  > o satisfy  Slater’s  constraint  qualification  (cf.  [11,  §5.4.3]),  so  the  Duality  Theorem  is  applicable. 
Therefore 

4>l°  = <t>D  = max^  j/(z)  - gT v | V/(z)  = v , v > ° | , (8-15) 

where  v is  a ( M = 2m  + l)-vector  of  dual  variables  that  can  also  be  written 


T f T T \T 

V = ( v{  , t>£  , V3  ) 

with  subvectors  v\  and  v 2 of  length  m. 

The  derivatives  required  for  the  dual  constraints  are 


so  the  constraints  themselves  can  be  written 

iTTS-1(u  1 - v2)  = w , (t>i  - v2)  - v3 -jZ—  - o 

V\  > O , V2  > o , 1)3  > 0 . 


(8.16) 


(8.17) 


(8.18) 


Note  that  the  dual  vectors  Vi  and  v2  appear  only  as  components  of  the  difference  v\—v2.  Even  though  Vi  and 
v2  are  constrained  to  be  nonnegative,  their  difference  is  totally  unconstrained.  Defining  new,  unconstrained 
dual  variables 


v0  = Vi  - v2 


(8.19) 
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allows  the  constraints  to  be  written 


Kt  S 1v0  — w , vq  = v3 


V rTr 


, ^3  > 0 , 


and  eliminating  the  variable  vq  gives 


KT  S~lr 

v3 - w ,V3  > 0 . 

V rTr 


(8.20) 


(8.21) 


Note  that  r /VrTr  is  a unit  vector,  and  v3  is  simply  a normalization  parameter  used  to  scale  the  length  of 
Kt  S~1  times  this  vector  to  equal  the  length  of  w. 

The  objective  function  for  the  dual  problem  is 


f(z)-gT(z)v  = wTx  + 


•T  S~l  — xT  KT  S~x  — r1 


(vi  - v2)  + v3  ( V rTr  - p. 


= wTx  4- 


yT S 1 — xtKtS  1 — rT|  vq  + ^3  rT r — . 


rTS~lKx 


= ^3- 


+ (yT  S 1 — xT  Kt  S 1 — rT)  +v3(VrTr—y,) 

VrTr 


= ^3 


yTs-lr 
V rTr 


- M 


To  simplify,  we  introduce  a change  of  variables 


v3S  1r 
VrTr 


u = 


(8.22) 

(8.23) 

(8.24) 

(8.25) 

(8.26) 


and  note  that  the  variables  u are  constrained  only  by  KTu  = w.  Then  multiplying  both  sides  of  (8.26)  by 
S and  “squaring”  yields 

uT S2u  = U3  . (8.27) 

Now  we  can  write  the  dual  objective  function  as 


f(z)  - gT(z)v  - yT u - y.V uT S2u, 

yielding,  after  substitution  of  y,  — \/r2  + k2,  our  final  form  of  the  dual  problem 

ft0  = max  | uTy  - (r2  + k2)^  VuT  S3u  | uT  K = wT  } . 


(8.28) 


(8.29) 


In  a similar  manner,  it  can  be  shown  that  the  dual  problem  for  the  upper  bound  problem  in  (5.1)  can  be 
written 

(8.30) 


4>ur  = min  { uT y + (f2  + k2)  * VuT S3u  \ uT K = wT  } 


It  is  easy  to  independently  verify  that  these  two  dual  problems  are  equivalent  to  the  primals  (5.1)  because 
they  too  can  be  solved  by  Lagrange  multipliers.  The  solution  vectors  are 


ul°  - u + ^S^5'3(y-  Kx)  , 


'UP  _ - _ VuTS2U  g-3 


(8.31) 


u r = u — 


S-3(y-Ki ) , 


where  u is  given  by  (3.8)  and  x by  (3.11).  The  corresponding  optimal  values  are 


(j>l°  — ( ulo)Ty  — (f2  + /c2)  J yj (ulo)T S2u°  — uT y — k-\ZuT S2u  , 
4>up  = ( uup)Ty  + (f2  + /c2)^  yf(uup)TS2uup  = uTy  + kVuT S2u 


(8.32) 
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which  are  the  same  as  the  values  given  by  (4.7). 

The  primal  problems  (5.1)  correspond  to  a Gaussian  formulation  of  linear  interval  estimation  while  the 
corresponding  dual  problems  (8.29)  and  (8.30)  correspond  to  the  Markov  formulation.  In  the  objective 
functions  of  the  latter,  the  inner  product  uTy  is  a linear  estimator  of  the  unknown  <f>  and  the  quantity 
uT S~3u  is  the  variance  of  that  estimator.  The  constraint  functions  common  to  both  problems  require  the 
estimators  to  be  unbiased.  When  nonnegativity  constraints  are  imposed  on  the  problems,  the  forms  of  the 
objective  functions  are  unaffected,  but  the  constraint  functions  change  in  an  interesting  way. 


8.3.  Lower-biased  and  Upper-biased  Estimators 


Now  consider  the  nonnegatively  constrained  interval  estimation  problems  (8.3).  The  lower  bound  primal 
problem  is 

= min  { wT  x | (y  — Kx)T  S~2(y  — Kx ) < p2  , x > o } , (8.33) 

•E 

where 

M = \/ Pmin  + K2  • (8.34) 

The  transformation  of  this  problem  to  its  dual  form  is  similar  to  that  given  in  the  preceding  section  for  the 
unconstrained  problem  (8.8).  The  only  difference  is  the  additional  set  of  n constraints  x > o,  which  augment 
the  primal  constraint  relations  (8.12).  The  dual  vector  (8.16)  is  also  augmented  by  n additional  variables 
r4,  i.e., 

vT  = (v^  , , v3  , v%  )T  , (8.35) 

and  the  constraint  expressions  (8.18)  are  changed  to 


KT S l(t>i  - v2)  + v4  = w , 

Vl  > o , v2  > O , 


(v1-v2)-v3^T7  = 0 , 
V3  > 0 , U4  > O . 


(8.36) 


The  objective  function  remains  the  same  as  it  was  for  the  classical  problem,  and  the  definition  of  Vo  and 
the  determination  of  1*3  are  the  same  also.  The  net  result  is  that 

<f>l°  = max  < yT S~1vo  — fJ-\J v qVo  | KT S~1vq  + = w , > o > . (8.37) 

The  objective  function  does  not  depend  on  U4,  so  the  constraints  can  be  written  more  simply  as 

KtS~xv 0 < w . (8.38) 

Defining  u as  in  (8.26)  and  restoring  the  value  of  p,  gives 

4>l°  = max  | uTy  — (pmin  + k2)  3 V uT  S3u  | uT K < wT  | . (8.39) 

Similarly,  the  dual  statement  of  the  upper  bound  problem  can  be  shown  to  be 

0up  = min  | uTy  + (pmin  + k2)  3 V uT S3u  \ uT K > wT  | . (8.40) 

The  objective  functions  for  the  current  dual  problems  differ  from  those  of  the  classical  duals  (8.29)  and 

(8.30)  only  in  the  substitution  of  Pmin  for  r2.  The  constraints,  however,  are  quite  different.  In  the  classical 
case,  the  estimators  uTy  are  required  to  be  unbiased  in  both  cases.  For  the  nonnegatively  constrained 
problems,  the  estimator  for  the  lower  bound  is  required  to  be  lower  biased,  and  the  estimator  for  the  upper 
bound  is  required  to  be  upper  biased.  This  equivalence  of  bias  constraints  in  the  Markov  formulation  to 
nonnegativity  constraints  in  the  Gauss  formulation  is  a surprising  result  whose  significance  we  are  at  present 
unable  to  assess.  It  seems  very  logical  to  use  lower  biased  estimators  when  seeking  a lower  bound  and 
upper  biased  estimators  when  seeking  an  upper  bound,  but  such  constraints  appear  at  first  glance  to  be 
less  restrictive  than  the  classical  unbiasedness  constraints.  In  practical  problems,  nonnegatively  constrained 
confidence  intervals  are  smaller  than  the  corresponding  classical  intervals,  even  though  the  substitution  of 
Pmin  for  r2  in  the  objective  functions  should  tend  to  increase  the  size  of  the  former.  It  would  seem  then  that 
the  upper  and  lower  bias  constraints  are  considerably  stronger  than  the  unbiasedness  constraint. 
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8.4.  Proof  of  the  Burrus  Conjecture 

Having  established  the  equivalence  of  the  dual  problems  (8.39)  and  (8.40)  to  the  primals  (8.3),  it  is  not 
difficult  to  prove  the  Burrus  Conjecture  which  we  now  state  as  a theorem. 


Theorem.  Let  y be  a given  nvvector,  K a given  m X n matrix,  and  x an  unknown  n-vector  satisfying 

(8.41) 

and 


y = Kx  + e , e~  JV(o,  S2)  , 
Pr  { a:  > o } = 1 . 


Let  pm in  be  defined  by 


Pmi 


n = min  { (y  — K x)T  S 2(y—Kx ) | x > o } . 


Let  a be  a given  probability  (0  < a < 1),  and  suppose  that  the  value  /c  is  chosen  so  that 


i r+K 
\[2*  J-K 


exp  ( — - ) dr]  - a 


(8.42) 

(8.43) 

(8.44) 


If  w is  a given  rvvector,  and 

<t>l°  = min  { wTx  | (y  - Kx)T S~2(y  - Kx)  < pm  in  + k2  , x>o } , 

35 

<j>up  — max  { wT x | (y  — Kx)T  S~2(y  — Kx)  < pn,in  + k.2  , x > o } , 

35 

then 

Pr  | ft0  < wTx  < 4>up  } > Q . 

Proof.  For  any  nvvector  it,  consider  the  reduced  random  variable 

uT  y — uT  Kx 
77  ~~  VuTS3u 

which  follows  the  standard  normal  distribution  n(0,  1).  It  follows  that 

Pr  | [uT y — k.Vut  S 3 uj  < uT  Kx  < [uT y + k.V ut  S3uj  j = a . 
Define  constants  and  P2  by 

Pr  | (uTy  — kVut  S3u^j  < uT Kx  j = — J exP  dy  = a + P?  , 
Pr  | uT Kx  < (uTy  + nV uT  S3u^j  j = J exp  dy  = Pi  + a , 


and  note  that 

Pi  + a + p2  = 1 . 

We  saw  in  the  preceding  section  that  tf>l°  and  < ■j>up  can  also  be  defined  as 

<t>l°  — max  | uTy  — (pmin  + k2)  3 VuT  S3u  \ uT K < wT  j , 
4>up  = min  | itTy  + (pmi„  + «2) 3 y/ uT  S3u  | uT  K > wT  j . 

Consider  first  the  upper  bound  problem  and  let  t tuP  be  its  solution  vector,  i.e., 

4>UP  = ™lpy  + (Pmin  + K2)  3 \JUlpS3Uup  , 


(8.45) 

(8.46) 

(8.47) 

(8.48) 

(8.49) 

(8.50) 

(8.51) 

(8.52) 
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with 


ulvK  > wT  . 

From  the  second  relation  in  (8.49),  since  k < pm;n  + *c2,  it  follows  that 

(8.53) 

Pr{U«p“K’*  < [ Uupy+  (pmin  + K2)  ' ^ulvS3Uuv  J j > 01  + Q 

, (8.54) 

or,  by  (8.52),  that 

Pr  { uZPK*  < <t>up } > 0i  + q • 

(8.55) 

Since  x > o,  it  follows  from  (8.53)  that 

wT  x < u^pKx  . 

(8.56) 

Combining  this  inequality  with  (8.55)  gives 

Pr  | wT x < 4>up  | > 0i  + a . 

(8.57) 

In  a similar  manner,  it  is  easy  to  show  that 

Pr  | <f>l°  < wT  x | > a + 02  ■ 

(8.58) 

Now, 

Pr  | 4>l°  < wTx  < 0up  | = Pr  ||  (j>l°  < wTx  } D { < <t>up  } 

] 

- Pr  | 0l°  < wTx  | + Pr  | wT x < <f>up 

| (8.59) 

— Pr  1 1 <pl°  < wTx  | (J  | wTx  < if>up 

}]  • 

so  by  (8.57)  and  (8.58), 

Pr  | <t>l°  < wT x < 0up  | > 0i  + 2a  + 02  - Pr  1 1 <^io  < wTx  | U { wT x < <j>up  | j . (8.60) 

From  definition  (8.45),  we  have  0l°  < <j>up,  30 

Pr  [ { 4>l°  < wTx  } [J  | wT x < (f>up  ||=1, 

(8.61) 

which,  when  substituted  into  the  above  expression,  gives 

Pr  | <f>l°  < wT  x < <f>up  | > 0i  + 2q  + 02  - 1 . 

(8.62) 

It  follows  then  from  (8.50)  that 

Pr  | (j>l°  < wT  x < <t>up  | > a , 

(8.63) 

which  completes  the  proof  of  the  theorem.  □ 

Note  that  the  theorem  makes  no  restrictions  on  m,  n,  or  rank(if).  For  an  underdetermined  system  in 
which  K has  some  negative  elements,  it  is  possible,  but  not  very  probable,  that  one  of  the  two  problems 
(8.45)  may  be  unbounded.  In  that  case,  the  theorem  is  still  true,  but  the  confidence  interval  is  semi-infinite. 

In  proving  the  theorem,  we  assumed  that  the  errors  e were  normally  distributed.  This  is  generally  not 
a serious  limitation  in  practice,  but  it  is  worth  noting  that  the  theorem  can  be  proved  for  a wider  class 
of  possible  error  distributions.  The  essential  restriction  required  by  the  proof  is  that  the  reduced  random 
variable  t]  have  the  same  probability  distribution  for  all  m- vectors  u. 
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Corollary  1 . Let  y be  a given  m-vector,  K a given  m x n matrix,  and  x an  unknown  n-vector  satisfying 

y = Kx  + e , £(e)  — o , £ ^eeT^  = (8.64) 

and 


Pr{i>o}=l. 


(8.65) 


Let  pmin  be  defined  by  (8.43)  and  assume  that  the  errors  i are  distributed  in  such  a manner  that  the 
reduced  random  variable 

uTy  — uT  Kx 

1 = (8.66) 

vuJ  S3u 

has  the  same  probability  density  function  /(r?)  for  all  nvvectors  u.  Let  a be  a given  probability 
(0  < a < 1),  and  suppose  that  the  value  k is  chosen  so  that 


Pr 


/K. 

f{r))dr)  = 

• K 

If  w is  a given  n-vector,  and  if  </r°  and  <£up  are  defined  by  (8.45),  then 

Pr  | ft0  < wTx  < ]>up  } > 


a . 


a . 


(8.67) 


(8.68) 


Proof.  The  proof  differs  from  that  of  the  Theorem  only  in  the  definitions  of  0i  and  02,  i.e.,  Eqs.  (8.49)  are 
replaced  by 

/OO 

/(7?)  dr]  = a + , 

- K, 


Pr|wTffa5  < [uT y + kV ut  S3u^j  | = J f{ri)dri  = (3\  + a . 

The  values  and  02  satisfy  (8.50),  and  the  remainder  of  the  proof  follows  exactly  as  before.  □ 


(8.69) 


8.5.  Examples 

Consider  first,  confidence  intervals  based  on  the  normality  assumption.  Figure  9 gives  plots  of  95%  confidence 
intervals  for  both  Problems  I and  II  using  2-point  averaging.  The  Problem  I bounds  can  be  compared  with 
those  in  the  top  frame  of  Figure  4 in  order  to  assess  the  improvement  to  be  attributed  to  the  nonnegativity 
constraints.  Problem  I is  not  a bad  problem,  so  the  improvement  is  less  striking  than  that  normally  attained 
for  real-world  problems.  The  bounds  for  Problem  II  do  represent  a striking  improvement  since,  in  the  absence 
of  the  nonnegativity  constraints,  all  the  confidence  intervals  are  unbounded.  The  bounds  for  Problem  II  are 
unimpressive  when  compared  with  those  for  Problem  I,  but  they  could  be  significantly  improved  by  using 
3-point  averaging. 


9.  A Real-World  Problem 

We  conclude  with  a gamma-ray  unfolding  problem  suggested  by  a series  of  measurements  made  by  Faddegon, 
et  al.  [7].  The  data  from  these  experiments  have  been  archived  [5]  and  were  made  available  to  us  by  Dr. 
Faddegon.  In  one  experiment,  a beam  of  electrons  with  energies  tightly  clustered  around  15  MeV  was  incident 
on  the  end  of  a cylinder  of  lead  just  thick  enough  to  stop  them.  A sodium  iodide  scintillation  detector  coupled 
to  a photomultiplier  tube  was  used  to  measure  the  resulting  bremsstrahlung  photon  spectrum  at  an  angle 
of  60°  from  the  axis  of  the  cylinder.  In  this  problem,  the  variable  £ becomes  the  energy  £ of  the  photons 
incident  on  the  detector,  and  the  subscripts  i become  the  bin  numbers  of  the  pulse  height  bins  into  which 
the  photons  are  counted.  The  underlying  integral  equations  are 

OO 

R(El,£)Af{£)d£ + e\  , * = l,2,...,m,  (9.1) 


N(Ei)dEi  = dEi  f 
J t 
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150  X 121  Phillips  Problem 
One— at— a-time  Confidence  Itervals 
Using  Normal  Distribution,  a = 0.95 


£ 

108  X 121  Phillips  Problem 
One-at-a-time  Confidence  Intervals 
Using  Normal  Distirbution,  a = 0.95 
Nonneg.  constraints,  2 -Point  Averaging 


£ 

Figure  9:  Nonnegatively  constrained,  one-at-a-time  95%  confidence  intervals  from  the  normal  distribution. 
Dashed  lines  are  exact  solutions  x(£),  and  solid  lines  are  histograms  of  estimated  lower  and  upper  bounds. 
Top:  Problem  I,  2-point  averages.  Bottom:  Problem  II,  2-point  averages. 
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where  Eq  is  the  threshold  energy  for  detection,  Ex  is  the  central  energy  of  the  zth  pulse-height  bin,  dEi  is 
the  width  of  the  pulse-height  bin,  N(E{)dEi  is  the  total  number  of  photons  counted  in  the  bin,  A f{£)  is 
the  unknown  photon  spectrum  ( Af(£)d£  is  the  number  of  photons  in  an  energy  band  of  width  d£  centered 
on  energy  £ ),  R(Ei,£)dE{  is  the  instrument  response  function  ( the  probability  that  a photon  of  energy  £ 
will  produce  a count  in  the  zth  bin  ),  and  is  the  measurement  error.  The  £x  were  due  mostly  to  counting 
errors  and  thus  were  approximately  Poisson  distributed,  but  the  counts  were  also  corrected  for  background 
radiation  and  various  hardware  effects.  The  details  on  the  calculation  of  the  standard  deviations  sx  are  given 
in  [7].  We  assume  that  the  normal  distribution  n(0,$i)  gives  a sufficiently  accurate  approximation  to  the 
e't-distribution  and  that  the  errors  were  not  correlated  with  one  another.  Thus, 

e~jV(o,  S2)  , S = diag(si,s2,...,sm)  , (9.2) 

Discrete  approximations  to  the  detector  response  functions,  i.e., 

Rij  = R(Ei,£j)dEi  , J = 1, 2, . . .,  n , (9.3) 

were  determined  by  a combination  of  calculations  and  high-precision  measurements  described  in  [6].  Using 
those  values  to  approximate  the  system  (9.1)  gives 

n 

N(Ej)dEj  - RjtjAf(£j)A£j  + £ , i = l,2,...,m,  (9.4) 

j=i 

where  A £j  = £j  — £j~\.  For  the  particular  set  of  measurements  that  concerns  us  here,  the  values  of  m and 
n are  56  and  66  respectively.  It  was  necessary  to  choose  n > m because  the  response  functions,  which  are 
shown  in  Figures  10  and  11,  have  nonzero  values  outside  the  range  [£0,  £n].  Thus,  the  numbers  counted  in 
the  lower  bins  can  be  affected  by  photons  with  energies  less  than  that  of  the  lowest  bin  and  counts  in  the 
upper  bins  can  similarly  be  affected  by  photons  with  energies  higher  than  that  of  the  highest  bin.  These 
extra  columns  can  be  dropped  from  the  response  matrix  if  it  is  known  with  certainty  that  the  spectrum  of 
incident  photons  is  zero  at  the  corresponding  energies. 

The  measured  spectrum,  normalized  to  unity,  is  plotted  in  Figure  12  as  a histogram  with  two  horizontal 
lines  plotted  for  each  bin.  The  top  line  gives  the  measured  value  plus  one  standard  deviation  and  the  bottom 
line  gives  the  value  minus  one  standard  deviation.  Since  the  actual  value  cannot  be  negative,  it  is  possible 
to  compute  confidence  interval  estimates  for  the  true  spectrum  even  though  the  matrix  of  the  system  has 
more  columns  than  rows.  In  order  to  compute  95%  one-at-a-time  confidence  intervals  for  the  estimates  of 
the  actual  value,  we  took  ic  = 1.96  and  applied  the  BRACKET-LS  algorithm,  first  with  two-point  averaging 
and  then  with  no  averaging.  The  top  panel  of  Figure  13  gives  the  results  for  two-point  averaging.  Note  that 
the  averaging  gave  a bin  structure  similar  to  that  of  the  measured  spectrum.  The  maximum  of  the  spectrum 
occurs  in  the  bin  containing  the  .511  MeV  electron-positron  annihilation  energy.  The  pronounced  shoulder 
to  the  right  of  this  peak,  together  with  the  relatively  small  interval  widths  encouraged  us  to  repeat  the 
calculation  with  no  averaging.  The  results  are  shown  in  the  bottom  panel  of  the  Figure.  The  dark  vertical 
lines  represent  the  calculated  confidence  intervals,  and  the  dashed  line,  which  connects  the  non-negatively 
constrained  point  estimates,  was  added  only  as  a visual  aid.  With  the  given  data,  it  is  not  possible  to 
estimate  the  spectrum  at  the  intermediate  points,  but  the  results  that  were  obtained  suggest  that  higher 
energy  resolution  would  be  possible  if  the  response  functions  could  be  computed  on  a denser  £-mesh.  In 
particular,  there  is  a hint  that  a smaller  peak  might  be  resolved  in  the  vicinity  0.7  MeV . 
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Sodium  Iodide  Photon  Detector 
Instrument  Response  Functions 


Figure  10.  Response  functions  R+  j for  the  Nal  gamma-ray  detector.  For  clarity,  we  have  plotted  only  every 
2nd  ordinate  in  each  of  the  abscissa  directions.  To  show  the  structure  at  higher  energies,  we  have  plotted 

l°Sio(l  “h  ^*,j)  rather  than  RitJ  itself.  Note  the  finer  mesh  spacing  and  higher  energy  resolution  at  the  lower 
energies. 
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Response  Response 


Sodium  Iodide  Photon  Detector 


Selected  Response  Functions 
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Figure  11:  Cross-section  views  of  the  response  functions  Rij. 

Top:  Responses  for  pulse  height  bins  centered  on  Sj  = 0.182,  0.758,  1.945,  3.93,  6.88,  10.95,  16.96  MeV 
Bottom:  Responses  for  incident  energy  bins  centered  on  25*  = 0.546,  1.545,  3.29,  5.95,  9.70,  15.30  MeV  . 
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Normalized  Counts  (MeV) 


Measured  Bremsstrahlung  Spectrum 
15  MeV  Electrons  on  Pb  Cylinder 
Photon  Yield  at  60-deg.  Angle 


Figure  12:  Measured  pulse-height  spectrum  for  bremsstrahlung  yields  produced  at  60°  from  the  axis  by  15 
MeV  electrons  incident  on  the  end  of  the  Pb  cylinder.  The  two  levels  plotted  in  each  bin  are  ( dS/dE\  ± s,  . 
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Normalized  Counts  (MeV)  Normalized  Counts  (MeV) 


Unfolded  Bremsstrahlung  Spectrum 


Two— point  Averaging,  a = 0.95 


No  Averaging,  a = 0.95 


Figure  13:  Estimated  95%  one-at-a^time  confidence  intervals  for  the  actual  yields  ( dS/d£)j  . 

Top:  Average  values  over  the  intervals  £j  - £j_l  . 

Bottom:  Confidence  intervals  for  the  individual  ( dS/d£)j  . The  dashed  line  connects  the  point  estimates. 
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